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Abstract 


We propose a new, nonparametric method for multivariate regression subject to convexity or con- 
cavity constraints on the response function. Convexity constraints are common in economics, 
statistics, operations research, financial engineering and optimization, but there is currently no 
multivariate method that is stable and computationally feasible for more than a few thousand ob- 
servations. We introduce convex adaptive partitioning (CAP), which creates a globally convex 
regression model from locally linear estimates fit on adaptively selected covariate partitions. CAP 
is a computationally efficient, consistent method for convex regression. We demonstrate empir- 
ical performance by comparing the performance of CAP to other shape-constrained and uncon- 
strained regression methods for predicting weekly wages and value function approximation for 
pricing American basket options. 

Keywords: adaptive partitioning, convex regression, nonparametric regression, shape constraint, 
treed linear model 


1. Introduction 


Consider the regression model for x € X C R? and y € R, 


y = fo(x) +€, 


where fo : RP — R and € is a mean 0 random variable. In this paper, we study the situation where 
fo 1s convex. That is, 


Afo(x1) + (1 —A) fo(x2) = fo(Axi + (1 —A)x2), 


for every X1,X2 € X and A € (0,1). Given the observations (x;,y1),.--,(Xn,¥n), we would like to 
estimate fo subject to the convexity constraint. Convex regression is easily extended to concave 
regression since a concave function is the negative of a convex function. 

Convex regression problems occur in a variety of settings. Economic theory often dictates 
that demand (Varian, 1982), production (Varian, 1984; Allon et al., 2007) and consumer prefer- 
ence (Boyd and Vandenberghe, 2004) functions are concave. In financial engineering, stock option 
prices often have convexity restrictions (Ait-Sahalia and Duarte, 2003). Stochastic optimization 
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problems in operations research and reinforcement learning can be solved with response surfaces 
(Lim, 2010) or value-to-go functions. These exhibit concavity in many settings, like resource al- 
location (Topaloglu and Powell, 2003; Powell, 2007; Toriello et al., 2010) or stochastic control 
(Keshavarz et al., 2011). Similarly, efficient frontier methods like data envelopment analysis (Kuos- 
manen and Johnson, 2010) include convexity constraints. In density estimation, shape restrictions 
like log-concavity provide flexible estimators without tunable parameters (Cule et al., 2010; Cule 
and Samworth, 2010; Schuhmacher and Diimbgen, 2010). Finally, in optimization, convex approxi- 
mations to polynomial constraints are valuable for geometric programming (Kim et al., 2004; Boyd 
et al., 2007; Magnani and Boyd, 2009). 

Although convex regression has been well explored in the univariate setting, the literature re- 
mains underdeveloped in the multivariate setting. Methods where an objective function is con- 
strained to the set of convex functions through supporting hyperplane constraints for each pair of 
observations (Hildreth, 1954; Holloway, 1979; Kuosmanen, 2008; Seijo and Sen, 2011; Lim and 
Glynn, 2012; Allon et al., 2007) or semidefinite constraints over all observations (Roy et al., 2007; 
Aguilera and Morin, 2008, 2009; Henderson and Parmeter, 2009; Wang and Ni, 2012) are too com- 
putationally demanding for more than a few thousand observations. 

In more recent approaches, different methods have been developed. Fitting a convex hull to a 
smoothed version of the data (Aguilera et al., 2011) scales to larger data sets, but is inefficient for 
more than 4 or 5 dimensions. Refitting a series of hyperplanes can be done in a frequentist (Magnani 
and Boyd, 2009) or Bayesian (Hannah and Dunson, 2011) manner. While the Bayesian method does 
not scale to more than a few thousand observations, the frequentist method scales to much larger 
data sets but can exhibit unstable behavior. Recent literature is more fully reviewed in Section 2. 

In this paper, we introduce the first computationally efficient and theoretically sound multivari- 
ate convex regression method: convex adaptive partitioning (CAP). It fits a series of hyperplanes to 
the data through adaptive partitioning. It relies on an alternate, first-order definition of convexity, 


fo(x1) > fo(x2) + go(x2)" (x1 — x2), (1) 


for every X,,X2 E€ X, where go(x) € Ofo(x) is a subgradient of fp at x. Equation (1) states that a 
convex function lies above all of its supporting hyperplanes, or subgradients tangent to fo. More- 
over, with enough supporting hyperplanes, fo can be approximately reconstructed by taking the 
maximum over those hyperplanes. 

The CAP estimator is formed by adaptively partitioning a set of observations in a method similar 
to trees with linear leaves (Chaudhuri et al., 1994). Within each subset of the partition, we fit a linear 
model to approximate the subgradient of fọ within that subset. Given a partition with K subsets and 
linear models, (x, Be) É], a continuous, convex (concave) function is then generated by taking the 
maximum (minimum) over the hyperplanes by 


T 
fax) = o Tad By x. 
The partition is refined by a twofold strategy. First, one of the subsets is split along a cardinal 
direction (say, x; or x3) to grow K. Then, the hyperplanes themselves are used to refit the subsets. 
A piecewise linear function like f,, induces a partition; a subset is defined as the region where a 
particular hyperplane is dominant. The refitting step places the hyperplanes in closer alignment with 
the observations that generated them. This procedure is repeated until all subsets have a minimal 
number of observations. The CAP estimator is then created by selecting the value of K that balances 
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fit with complexity using a generalized cross validation method (Golub et al., 1979; Friedman, 
1991). We show that CAP is consistent with respect to the fo metric. Because of the dramatic 
reduction in runtime, CAP opens a new class of problems for study, namely moderate to large 
problems with convexity or concavity constraints. 


2. Literature Review 


The literature for convex regression is scattered throughout a variety of fields, including statistics, 
operations research, economics numerical analysis and electrical engineering. Most methods are de- 
signed for the univariate setting, which is closely related to isotonic regression. Univariate methods 


rely on the ordering implicit to the real line. Setting x; < xj < xi+1 fori=2,...,n—1, 
foli) = foi) < folii) = fol) i=2.. n1, (2) 
Xi —Xi-1 Xi+1 — Xi 


is equivalent to Equation (1). When fo is differentiable, Equation (2) is equivalent to an increasing 
derivative function. 

The oldest and simplest solution method is the least squares estimator (LSE), which produces 
a piecewise linear estimator by solving a quadratic program with a least squares objective function 
subject to the constraints in Equation (2) (Hildreth, 1954; Dent, 1973). Although the LSE is com- 
pletely free of tunable parameters, the estimator is not smooth and can overfit in boundary regions. 
Consistency, rate of convergence, and asymptotic distribution were shown by Hanson and Pledger 
(1976), Mammen (1991) and Groeneboom et al. (2001), respectively. Algorithmic methods for 
solving the quadratic program were given in Wu (1982); Dykstra (1983) and Fraser and Massam 
(1989). 

Splines use linear combinations of basis functions to produce a smooth estimator; in univari- 
ate convex regression, an increasing function can be fit to the derivative of the original function. 
Meyer (2008) and Meyer et al. (2011) used convex-restricted splines with positive parameters in 
frequentist and Bayesian settings, respectively. Turlach (2005) and Shively et al. (2011) used unre- 
stricted splines with restricted parameters in frequentist and Bayesian settings, respectively. In other 
methods, Birke and Dette (2007) used convexity constrained kernel regression. Chang et al. (2007) 
used a random Bernstein polynomial prior with constrained parameters. Due to the constraint on 
the derivative of fo, univariate convex regression is quite similar to univariate isotonic regression; 
see Brunk (1955), Hall and Huang (2001), Neelon and Dunson (2004) and Shively et al. (2009) for 
examples. 

In the multivariate setting Equation (1) cannot be reduced to a set of n — 1 linear inequalities. 
Instead, it needs to hold for every pair of points. The multivariate least squares estimator Hildreth 
(1954); Holloway (1979) solves the quadratic program, 


n 
min $ (yi — Si)” (3) 
i=1 
subject to ~; > $i++ g (x; SK LJ = lee 


Here, 9; and g; are the estimated values of fo(x;) and the subgradient of fo at x;, respectively. The 
estimator f7°” is piecewise linear, 


fo" (x) = max $i +g (x—x;). 
i€{1,...,n} 
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The characterization (Kuosmanen, 2008) and consistency (Seijo and Sen, 2011; Lim and Glynn, 
2012) of the least squares problem have only recently been studied. The LSE quickly becomes 
impractical due to its size: Equation (3) has n(n — 1) constraints. This results in a computational 
complexity of O((p + 1)*n°) (Monteiro and Adler, 1989), which becomes impractical after one to 
two thousand observations. It can also severely overfit in boundary regions. In similar approach, 
Allon et al. (2007) proposed a method based on reformulating the maximum likelihood problem 
as one minimizing entropic distance, again subject to n? linear constraints generated by the dual 
problem. 


An alternative to first order constraints in Equation (1) 1s second order, or Hessian, constraints. 
Roy et al. (2007) and Aguilera and Morin (2008, 2009) solved a math program with a least squares 
objective function and semidefinite constraints through semidefinite programming. Henderson and 
Parmeter (2009) used kernel smoothing with a restricted Hessian and found a solution with sequen- 
tial quadratic programming. While these methods are consistent in some cases (Aguilera and Morin, 
2008, 2009), they are computationally infeasible for more than about a thousand observations. 


Recently, multivariate convex regression methods have been proposed with different approaches. 
Aguilera et al. (2011) proposed a two step smoothing and fitting process. First, the data were 
smoothed and functional estimates were generated over an €-net over the domain. Then the convex 
hull of the smoothed estimate was used as a convex estimator. Again, although this method is con- 
sistent, it is sensitive to the choice of smoothing parameter and does not scale to more than a few 
dimensions. Hannah and Dunson (2011) proposed a Bayesian model that placed a prior over the set 
of all piecewise linear models. They were able to show adaptive rates of convergence, but the in- 
ference algorithm did not scale to more than a few thousand observations. Koushanfar et al. (2010) 
transformed the ordering problem associated with shape constrained inference into a combinatorial 
optimization problem which was solved with dynamic programming; this scales to a few hundred 
observations. 


The work that is closest to CAP is an iterative fitting scheme of Magnani and Boyd (2009). In 
this method, the data were divided into K random subsets and a linear model was fit within each 
subset; a convex function was generated by taking the maximum over these hyperplanes. This new 
function induced a partition over the covariate space, which generated a new collection of K subsets. 
Again, linear models were fitted and another convex function was produced by taking the maximum 
over the new hyperplanes. This sequence was repeated until convergence. Although this method 
usually produces a high quality estimate, it does not always converge and can be unstable. 


3. Convex Adaptive Partitioning 


A natural way to model a convex function fo is through the maximum of a set of K hyperplanes. We 
do this by partitioning the covariate space and approximating the gradients within each region by 
hyperplanes generated by the least squares estimator. The covariate space partition and K are chosen 
through adaptive partitioning. Given a partition {A,,...,Ax«} of X, an estimate of the gradient for 
each subset can be created by taking the least squares linear estimate based on all of the observations 
within that region, 


(O%,B.) =argmin }, (yi—a-— BTX)”. 


OP i:x;EA, 
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A convex function f can be created by taking the maximum over (Qk, Be) ; 


a T 
ee. ee Ox + B; x. 

Adaptive partitioning models with linear leaves have been proposed before; see Chaudhuri et al. 
(1994), Chaudhuri et al. (1995), Alexander and Grimshaw (1996), Nobel (1996), Dobra and Gehrke 
(2002), Gyorfi et al. (2002) and Potts and Sammut (2005) for examples. In most of these cases, 
the partition is created by adaptively refining an existing partition by dyadic splitting of one subset 
along one dimension. That is, all data is initially placed within a single subset, which is then split 
into two new subsets along a single dimension, for example at xı = 5. The split dimension and 
value is chosen in a way that minimizes local error within the subset, through impurity (Chaudhuri 
et al., 1994) or mean squared error minimization (Alexander and Grimshaw, 1996). The SUPPORT 
algorithm of Chaudhuri et al. (1994) computes test statistics for the difference between the means 
and variances of the residuals and selects the split with the smallest associated p—value. Splitting 
is continued within a subset until a terminal level of purity or a minimal number of observations is 
reached in that subset; however, SUPPORT uses a cross-validation based method as a stopping rule. 
Once a full tree has been created, it is pruned using a variety of cross-validation based methods that 
aim to remove individual leaves or branches to produce the most simple tree that represents the data 
well; see Breiman et al. (1984) and Quinlan (1993) for pruning methods. 

There are two problems that arise when a piecewise linear additive function, 


K 
f (x) a ` (ax + B: x) lizel}, 
k=1 


is changed into a piecewise linear maximization function, like f. First, a split that minimizes local 
error does not necessarily minimize global error for f. This is easily remedied by selecting splits 
based on minimizing global error. The second problem is more difficult: the linear models often act 
in areas over which they were not estimated. 

The piecewise linear max function, f,, generates a new partition, {A}, ..., A% }, by 


Ay, ={xEX: Oj + BEX > Oj; + Bix, V JAK}. 


The partition {A,,...,A«} is not necessarily the same as {A,...,A;}. We can use this new partition 
to refit the hyperplanes and produce a significantly better estimate. A graphical representation is 
given in Figure 1. 

Refitting hyperplanes in this manner can be viewed as a Gauss-Newton method for the non- 
linear least squares problem (Magnani and Boyd, 2009), 


n 2, 
minimize ` (x — max (au + Bf) ) . 


p ke{1,...,K} 


Similar methods for refitting hyperplanes have been proposed in Breiman (1993) and Magnani and 
Boyd (2009). However, repeated refitting may not converge to a stationary partition and is sensitive 
to the initial partition. 

Convex adaptive partitioning uses adaptive partitioning with linear leaves to fit a convex function 
that is defined as the maximum over the set of leaves. The adaptive partitioning itself differs from 
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a, + B, x! 





Figure 1: The original space partition A and accompanying data partition C with hyperplanes fit 
according to that partition (left), the convex estimator based on those hyperplanes; some 
points are not represented by the hyperplane they were used to fit (center), and subsets 
refit based on the hyperplanes (right). 


previous methods in order to fit piecewise linear maximization functions. Partitions are refined in 
two steps. First, candidate splits are generated through dyadic splits of existing partitions. These 
are evaluated and the one that minimizes global error is greedily selected. Second, the new partition 
is then refit. Although simple, these rules, and refitting in particular, produce large gains over naive 
adaptive partitioning methods; empirical results are discussed in Section 6. 

Most other adaptive partitioning methods use backfitting or pruning to select the tree or partition 
size. Due to the construction of the CAP estimator, we cannot locally prune and so instead we rely 
on model selection criteria. We derive a generalized cross-validation method for this setting that is 
used to select K. This is discussed in Section 5. 


3.1 The Algorithm 


We now introduce some notation required for convex adaptive partitioning. When presented with 
data, a partition can be defined over the covariate space (denoted by {A,,...,Ax}, with Ag C X) 
or over the observation space (denoted by {C},...,Cx}, with Ck C {1,...,n}). The observation 
partition is defined from the covariate partition, 


C; = {i : x; EAk}, k=1,...,K. 


The relationship between these is shown in Figure 1. CAP proposes and searches over a set of 
models, Mı,...,Mg. A model Mx is defined by: 1) the covariate partition {A1,...,Ag}, 2) the 
corresponding observation partition, {C},...,Cx}, and 3) the hyperplanes (a;,B i= fit to those 
partitions. 

The CAP algorithm progressively refines the partition until each subset cannot be split without 
one subset having fewer than a minimal number of observations, nmin- This value is chosen to 
balance increasing model complexity against accurate local model fit and computational complexity. 
When a relatively small number of observations is used to fit local linear models, the local models 
tend to fit noise. This is particularly problematic with linear models, which can predict extreme 
values based on overfit models. The issue is aggravated when the estimator is defined as a max over 
local linear models, which can be dominated by a few extreme values; it can cause instability in the 
estimator of Magnani and Boyd (2009). Therefore, we choose a conservative value for nmin, which 
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admits logarithmic partition growth, 


Amin = min f Ba. 2ld+ D} 


Dlog(n) 


Here D is a log scaling factor, which acts to change the base of the log operator. We briefly outline 
the CAP algorithm below. 


3.1.1 CONVEX ADAPTIVE PARTITIONING (CAP) 


1. Initialize. Set K = 1; place all observations into a single observation subset, Cy = {1,...,}; 
A; = X; this defines model M4. 


2. Split. Refine partition by splitting a subset. 


a. Generate candidate splits. Generate candidate model My je by 1) fixing a subset k, 2) fixing 
a dimension j, 3) dyadically dividing the data in subset k and dimensions j according to 
knot ag. This is done for L knots, all p dimensions and K subsets. 


b. Select split. Choose the model Mx; from the candidates that minimizes global mean 
squared error on the training set and satisfies ming |Cx| > nmin. Set K = K +1. 


3. Refit. Use the partition induced by the hyperplanes to generate model Mp. Set Mx = Mk if 
for every subset C, in Mg, |C,| > nmin. 





4. Stopping conditions. If for every subset Ck in Mz, |Ck| < 2nmin, stop fitting and proceed to 


step 5. Otherwise, go to step 2. 


5. Select model size. Each model M% creates an estimator, 


Le) nae Ot Bj x. 


Use generalized cross-validation on the estimators to select final model M* from {M,}_,. 


3.2 Splitting Rules 


To split, we create a collection of candidate models by splitting a single subset into two subsets. We 
create models for every subset and search along every cardinal direction by splitting the data along 
that direction. For a fixed dimension j and subset k, let xi be the minimum value and wk. be the 
maximum value of the covariates in this subset and dimension. Let 0 < aj < -+ < az < 1 bea set 


of evenly spaced knots that represent the proportion between xe and Xx. 


We create model M jxo by 1) fixing subset k € {1,...,K}, and 2) fixing dimension j € {1,..., p}. 


ik : es ik ay 
I = min{x;; ELE Gel. ce = max { Xj; IE CEN: 
Use the weighted average bige = agx”, +(1— ap) xP hax to split C; and A; in dimension j. Set 
C =AL E Ce Xij S bike}, Csi = {i:i € Cp, Xij > bike}, 
Ay = {X : X € Ag, xj < bire}, Ags, = {X : X E Ag, xj > bike} 
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(A) Original Partition 
(C) k=2 





Figure 2: (A) The original observation partition C for M2, (B) new splits generated from the subset 
Cı, and (C) new splits generated from the subset Cz. Since there is only one dimension, 
we fix j= 1. 


These define new subset and covariate partitions, C}.,,,; and Aj..,,; where Cy = Cy and Cy = Cy 


for k Æ k. See Figure 2 for an example. Fit hyperplanes (Gx, Peo in each of the subsets. The 
triplet of observation partition C}.,,;, covariate partition, A}.,,,, and set of hyperplanes (Gx, crea 
defines the model Mine: This is done for k = 1,...,K, j=1,...,pand@=1,...,L. After all models 
are generated, set K = K +1. 

We note that any models where ming IC] < Nmin are discarded. If all models are discarded in 


one subset/dimension pair, we produce a model by splitting on the subset median in that dimension. 


3.3 Split Selection 


We select the model M ke that gives the smallest global error. Let (a! B! ae , be the hyperplanes 
associated with M’,, and let 
jkl 


FIX) = max o 
f ) Ellek 7 


oT 


be its estimator. We set the model Mx to be the one that minimizes global mean squared error, 


jkén:= 


a OIB aes? 
Mx = g : (J,k,4) = argmin— È (yi — f(x) l 
i=l 


Set fr to be the minimal estimator. We note that Mx may not be unique, however this seldom occurs 
in practice. 
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3.4 Refitting 
We refit by using the partition induced by the hyperplanes. Let (œ1:x,B1:x) be the hyperplanes 
associated with Mx. Refit the partitions by 

G = fxi : Oy + Bix; > oj + Bix), J = a k} 


for k =1,...,K. The covariate partition, Aj. is defined in a similar manner. Fit hyperplanes in 
each of those subsets. Let My be the model generated by the partition C),...,Cy. Set Mg = Mx if 
IC] > nmin for all k. 


3.5 Stopping Criteria 


Stopping criteria are similar to those in tree-based models (Nobel, 1996; Gyorfi et al., 2002). That 
is, the model stops when there are not enough observations within each subset of leaf to generate 
any further candidate splits, 

[Ck < 2Nmin 


fork =1,...,K. After fitting to termination the final model size, however, is chosen through a 
pruning method discussed Section 5. 


3.6 Tunable Parameters 


CAP has two tunable parameters, L and nmin. L specifies the number of knots used when generating 
candidate models for a split. Its value is tied to the smoothness of fọ and after a certain value, 
usually 5 to 10 for most functions, higher values of L offer little fitting gain. 

We choose a minimal subset size, nmin, that admits at most O(log(n))subsets. A parameter D 
is used to specify a minimum subset size, nmin = n/(Dlog(n)). Here D transforms the base of the 
logarithm from e into exp(1/D). We have found that D = 3 (implying base ~ 1.4) is a good choice 
for most problems. 

Increases in either of these parameters increase the computational time. Sensitivity to these 
parameters, both in terms of predictive error and computational time, is empirically examined in 
Appendix B. 


3.7 Computational Efficiency 


Each round of CAP requires O(dKL) regressions to be fit for model proposal. Since observations 
are moved from one side of a threshold to another within each leaf, an efficient method 1s to maintain 
and update parameters and the sum of squares and cross products within each leaf. Alternately, a 
QR decomposition may be maintained and updated for each leaf (Alexander and Grimshaw, 1996). 
Unlike treed linear models, all linear models need to be refit for each round of CAP. 


4. Consistency 


Consistency for CAP can be shown in a related manner to consistency for other adaptive partitioning 
models, like CART (Breiman et al., 1984), treed linear models (Chaudhuri et al., 1994) and other 
variants (Nobel, 1996; Gyorfi et al., 2002). We take a two-step approach, first showing consistency 
for the mean function and first derivatives of a more traditional treed linear model based on CAP 
under the l~ metric and then we use that to show consistency for the CAP estimator itself. 
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Letting M% be the model for the CAP estimate after n observations, define the discontinuous 
piecewise linear estimate based on M7, 


Kn 
= 24 Ok + Bix ) Leap} 


where K, is the partition size, A;,...,Ax, are the covariate partitions and (Ox, Bi). ” , are the hyper- 
planes associated with M;. Let f,(x be the CAP estimator based on M%, 
T 
x)= max Of¢+)p;xX. 
falx) kE{1,... Kn} Pi 
Each subset A; has an associated diameter, dnk = SUP,, xc, ||%1 —X2||2- Define the empirical co- 
variate mean for subset k as x, = Gl icc, Xi. For x; € Ax, define 


Hood | : 
I; = a = ‘ G — I;I; 3 
Poar r=) 


1ECx 


Note that (oO, Bx) = G;,' Yiec, liyi whenever Gg is nonsingular. 
Let x;,...,X, be 1.1.d. random variables. We make the following assumptions: 


Al. X is compact and fo is Lipschitz continuous and continuously differentiable on X with Lips- 
chitz parameter C. 


A2. There is an a > 0 such that E [e“”~ 40“! | X = x] is bounded on X. 


A3. Let Ax be the smallest eigenvalue of |C,|~'Gy and A, = ming Ax. Then Àn, remains bounded 
away from 0 in probability as n — œ. 


A4. The diameter of the partition max, a — 0 in probability as n — œ. 


A5. The number of observations in each subset satisfies ming=1....x, |Cx| > d7! ,/nlog(n) in prob- 
ability as n — œ. 


Assumptions Al. and A2. place regularity conditions on fo and the noise distribution, re- 
spectively. Assumption A3. is a regularity condition on the covariate distribution to ensure the 
uniqueness of the linear estimates. Assumption A4. is a condition that can be included in the algo- 
rithm and checked along with the subset cardinality, |C;,|. If X is given, it can be computed directly, 
otherwise it can be approximated using {x; : i E€ Cz}. Assumption AS. ensures that there are enough 
observations in the terminal nodes to fit the linear models. 

To show consistency of f„ under the l. metric, we first show consistency of f; and its derivatives 
under the 4» metric in Theorem 1. This is similar to Theorem 1 of Chaudhuri et al. (1994) for treed 
linear models, although we need to modify it to allow partitions with an arbitrarily large number of 
faces. 


Theorem 1 Suppose that assumptions A1. through A5. hold. Then, 


onan sup [ax +B; x — fo(x)| — 0, „max sup ||Bx — Vfo(x)||.. + 0 


=1,... Kn xEA, =1,..., nXCA, 


in probability as n —> œ. 
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The CAP algorithm is similar to the SUPPORT algorithm of Chaudhuri et al. (1994), except the 
refitting step of CAP allows partition subsets to be polyhedra with up to K, faces. Theorem 1 is 
analogous to Theorem | of Chaudhuri et al. (1994); to prove our theorem, we modify parts of the 
proof in Chaudhuri et al. (1994) that rely on a fixed number of polyhedral faces. The proof is given 
in Appendix A. 

Using the results from Theorem 1, extension to consistency for f, under the Z. metric is fairly 
simple; this is given in Theorem 2. 


Theorem 2 Suppose that assumptions A1. through A5. hold. Then, 


sup | fn(x) — fo(x)| —> 0 


xEX 


in probability as n — œ. 


The proof follows immediately from Theorem 1 and some algebra. Details are given in the Ap- 
pendix A. 


5. Generalized Cross- Validation 


The terminal model produced by CAP can overfit the data. As a fast approximation to leave-one-out 
cross-validation, we use generalized cross-validation (GCV) (Golub et al., 1979; Friedman, 1991) 
to select the best model from all of those produced by CAP, M,,...,Mx. A given model Mx is 
generated by a collection of K linear models. In linear regression, GCV relies on the following 
approximation 


ly Ly (Yiz fal ne) Lys (yim Fulxi) \” 
-E 0i- filmi) = = we (Se | 4 
Boner (an) = ony o 
where H; is the i” diagonal element of the hat matrix, X(X’7X)~'X’, ÎL; is the estimator condi- 
tioned on all of the data minus element i. We note that Tr(H) is sometimes approximated by the 
degrees of freedom divided by the number of observations. 

The model Mx is defined by Cj,.. ou the a and the hyperplanes (Ox, Bx)¢_,, which 


were generated by the partition. Let (at? po \K ;_1 be the collection of hyperplanes generated 
when observation i is removed; notice that if i € Cy, only (Ox, Bg) changes. Let f ig be the estimator 
for model Mg with observation i removed. Using the derivation in Equation (4), 


n n 2 
TE oif) = EE (vi- max al dap Le 


nh n& kefl,....K} 
T 2 
Yi — Akli) — Pra Xi 


Il 
Sle 


~ 
| 
a) 


l Fi Lijec) 
2 
Yi — Aki) — Bko) Xi (5) 
1— Tr(H*( Dikey í 
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Figure 3: Log-continuous plots for number of observations vs. K (top), MSE (middle), and run- 
time in seconds (bottom) for GCV, 5-fold and 10-fold cross validation, plus/minus one 
standard error. Data were generated from 10 i.i.d. training sets with x ~ N5(0,/), 
y = (x1 + 5x2 +43)" — x44 .25xz +g, and € ~ N(0, 1). 


where, in a slight abuse of notation, H% is the diagonal entry of the hat matrix for subset k corre- 
sponding to element i, and 


ki) are max OTB 
ke{1,...K} 1 — TA) Liec} 


3212 


MULTIVARIATE CONVEX REGRESSION WITH ADAPTIVE PARTITIONING 


To select K, we find the K that minimizes the right hand side of Equation (5). Although more 
computationally intensive than GCV in linear models, the computational complexity for CAP GCV 
is similar to that of the CAP split selection step. 

We empirically compared GCV selection of K with 5- and 10-fold cross validation selection of 
K. GCV tends to select a smaller K than full cross validation, particularly on smaller problems. 
Predictive results, however, are comparable for moderate to large problem sizes (n > 5,000) while 
the runtime of GCV is orders of magnitude less than 5- and 10-fold cross validation. We should 
expect more discrepancy between cross-validation and GCV on smaller problems because GCV 
relies on an asymptotic approximation. In these cases, full cross validation selection of K may be 
worthwhile. Representative results are given in Figure 3. 

We can use generalized cross-validation to create a more efficient stopping rule for CAP. We 
note that GCV scores are often unimodal in K. Instead of fully growing the tree, we stop splitting 
after the score has increased twice in a row. The resulting algorithm is called Fast CAP; details are 
given in Appendix B. 


6. Empirical Analysis 


We compare shape constrained and unconstrained regression methods across a set of convex re- 
gression problems: two synthetic regression problems, predicting mean weekly wages and value 
function approximation for pricing basket options. 


6.1 Synthetic Regression Problems 


We apply CAP to two synthetic regression problems to demonstrate predictive performance and 
analyze sensitivity to tunable parameters. The first problem has a non-additive structure, high levels 
of covariate interaction and moderate noise, while the second has a simple univariate structure em- 
bedded in a higher dimensional space and low noise. Low noise or noise free problems often occur 
when a highly complicated convex function needs to be approximated by a simpler one (Magnani 
and Boyd, 2009). 


6.1.1 PROBLEM 1 


Here x € R°. Set 
y = (x1 + 5x2 +x3)° — x4 + 25x + €, 


where € ~ N(0,1). The covariates are drawn from a 5 dimensional standard Gaussian distribution, 
Ns(0,J). 
6.1.2 PROBLEM 2 
Here x € R!. Set 
y =exp(x'q) +e, 

where q was randomly drawn from a Dirichlet(1,...,1) distribution, 

q= (0.0680, 0.0160, 0.1707, 0.1513, 0.1790, 0.2097, 0.0548, 0.0337 ,0.0377,0.0791 )” . 
We set € ~ N(0,0.17). The covariates are drawn from a 10 dimensional standard Gaussian distribu- 


tion, Ni0(0, I) . 
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6.1.3 PREDICTIVE PERFORMANCE AND RUNTIMES 


We compared the performance of CAP and Fast CAP to other regression methods on problems 1 and 
2. We implemented the following shape constrained algorithms: the least squares regression (LSE) 
using cvx (Grant and Boyd, 2012, 2008), and the linear refitting algorithm of Magnani and Boyd 
(2009). The general methods included Gaussian processes (Rasmussen and Williams, 2006) using 
gpm1 in Matlab, tree regression with constant values in the leaves using classregtree in Matlab, 
multivariate adaptive regression splines (MARS) (Friedman, 1991) using ARES1ab in Matlab, and 
support vector machines (SVMs) using the e1071 package in R. 


For CAP and Fast CAP, we set the parameters to D = 3 and L = 10; the sensitivity to these 
parameters is examined in Appendix C. The parameter K was chosen by GCV for CAP. In Fast 
CAP, the number of random search directions was set to be p = min(d, 10). All methods were given 
a maximum runtime of 90 minutes, after which the results were discarded. Methods were run on 
10 random training sets and tested on the same testing set of 10,000 random covariates. Average 
runtimes and predictive performance are shown in Figure 4. 


Non-convex regression methods performed poorly compared to shape restricted methods, par- 
ticularly in the higher noise setting. Amongst the shape restricted methods, only CAP and Fast 
CAP had consistently low predictive error. The method of Magnani and Boyd (2009) can become 
unstable, which is seen in problem 1. Surprisingly, the LSE had high predictive error. This can be 
attributed to overfitting, particularly in the boundary regions. A demonstration is given in Figure 
5. Although CAP and Fast CAP had similar predictive performance, their runtimes often differed 
by an order of magnitude with the largest differences on the biggest problem sizes. Based on this 
performance, we would suggest using Fast CAP on larger problems. 


We note that the empirical rate of convergence for CAP and Fast CAP is much faster than would 
be predicted by minimax convergence rates. The results, however, are consistent with rates that 
adapt to an underlying linear subspace; this is examined in Appendix D. 


6.1.4 CAP AND TREED LINEAR MODELS 


Treed linear models are a popular method for regression and classification. They can be easily 
modified to produce a convex regression estimator by taking the maximum over the linear leaves. 
CAP differs from existing treed linear models in how the partition is refined. First, subset splits 
are selected based on global reduction of error. Second, the partition is refit after a split is made. 
To investigate the contributions of each step, we compare to treed linear models generated by: 1) 
local error reduction as an objective for split selection and no refitting, 2) global error reduction as 
an objective function for split selection and no refitting, and 3) local error reduction as an objective 
for split selection along with refitting. All estimators based on treed linear models are generated by 
taking the maximum over the set of linear models in the leaves. We compared the performance of 
these methods on problems 1 and 2 over 10 different training sets and a single testing set. Average 
predictive error is displayed in Figure 6. 


Global split selection and refitting are both beneficial, but in different ways. Refitting dramat- 
ically reduces predictive error, but can add variance to the estimator in noisy settings. Global split 
selection modestly reduces predictive error but can reduce variance in noisy settings, like problem 
1. The combination of the two produces CAP, which has both low variance and high predictive 
accuracy. 
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Figure 4: Mean squared error (top) and runtime in seconds (bottom) plus/minus one standard error 
on problem 1 (left) and problem 2 (right) for CAP, Fast CAP, Gaussian processes, MARS, 
CART, the least squares estimator, the linear fitting method of Magnani and Boyd (2009) 
and support vector machines. 


6.2 Predicting Weekly Wages 


We use shape restricted methods to predict mean weekly wages based on years of education and 
experience. The data are from the 1988 Current Population Survey (CPS); they originally appeared 
in Bierens and Ginther (2001) and can be accessed as ex1029 in the Sleuth2 package in R. The 
data set contains 25,361 records of weekly wages for full-time, adult, male workers for 1987, along 
with years experience, years of education, race (either back or white; no others were included in the 
sample), region, and whether the last job held was part time. 


A reasonable assumption for wages is that they are concave in years experience. Each year pre- 
viously worked should have decreasing returns for average wages until peak earnings are reached, 
with modest declines afterwards. Indeed, this pattern is seen in Figure 7 when we compare aver- 
age weekly wages against experience. Wages can also be expected to increase based on education 
level, but not in a concave or convex fashion. However, concavity can be generated with an expo- 
nential transformation of education; this is shown in Figure 7. We therefore used a transformation, 
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Figure 5: (A) The CAP estimator, and (B) the LSE fit to 500 observations drawn from y = A + 
x5 +£, where € ~ N(0,0.257). The covariates were drawn from a 2 dimensional uniform 
distribution, Unif[—1, 1]*. The LSE was truncated at predicted values of 2.5 for display, 
although some predicted values reached as high as 4,800 on [—1, 1]’. 
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Figure 6: Number of observations (log scale) vs. mean squared error (log scale) plus/minus one 
standard error for CAP and treed linear models with local split selection, with no refitting; 
local split selection, with refitting; and global split selection, with no refitting. 
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Figure 8: Root mean squared error (RMSE), left, and runtime in seconds, right, for CAP, Fast CAP, 
CART, MARS, and SVMs for predicting weekly wages based on years experience and 


years education. 


1 .2years education a5 4 covariate. Shape restrictions do not hold with any other covariates, so they are 
discarded. 

We implemented CAP, fast CAP, the linear model of Magnani and Boyd (2009), CART, MARS, 
and SVMs. Due to the problem size, we did not use Gaussian processes or the least squares esti- 
mator. We estimated RMSE through 10-fold cross validation. Results and runtimes are shown in 
Figure 8 for all methods except Magnani and Boyd (2009). This had a RMSE of 10,156, orders of 
magnitude larger than other methods, and was hence omitted from the figures. 
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CAP 385.7 + 20.8 12.8+0.8 
Fast CAP 385.7 + 20.8 1.9+0.2 


CART 489.6 + 26.1 2.6+0.2 


MARS 385.8 + 20.7 150.4+ 12.8 
Magnani and Boyd (2009) | 10156.0 + 9765.2 8.0+0.7 
SVM 388.9 + 20.7 72.9 1.6 


Table 1: Average RMSE and runtime in seconds, plus/minus one standard error for CAP, Fast CAP, 
CART, MARS, Magnani and Boyd (2009) and SVMs. 





This data set presents difficulties for many methods due to its size (n > 20,000) and highly 
skewed distribution. CAP, Fast CAP, MARS and SVMs all had comparable predictive error rates, 
while CART produced error rates about 27% higher. The linear fitting method of Magnani and 
Boyd (2009) occasionally tried to fit outliers with hyperplanes, resulting in about a 2,500% increase 
in predictive error. This potential instability is one of the largest drawbacks with the method of 
Magnani and Boyd (2009). In terms of runtimes, Fast CAP and CAP were both significantly faster 
than any methods that produced comparable results, with runtime reductions of more than 80% over 
SVMs. 


In Figure 9, we compare the predicted functions produced by CAP and SVMs. In areas with 
small amounts of data, such for people with low education, the SVM produces results that do not 
match prior information. In the SVM surface, someone with 0 years of experience and 0 years of 
education is predicted to have about a 150% larger weekly wage than a high school graduate with O 
years of experience and about the same weekly wage as someone with a 4-year college degree and 
0 years of experience. By imposing shape constraints, CAP eliminates these types of problems and 
produces a surface that conforms to prior knowledge. 


Unlike the surface produced by SVM regression, the surface produced by CAP is not smooth. A 
greater degree of smoothness can be added through ensemble methods like bagging (Breiman, 1996) 
and smearing (Breiman, 2000). Averaging randomized convex estimators produces a new convex 
estimator; these methods have been explored for approximating objective functions in Hannah and 
Dunson (2012). A surface produced by smearing CAP is shown on the right in Figure 9. Note that 
its overall shape is quite similar to the original CAP estimator while most of the sharp edges have 
been smoothed away. 


6.3 Pricing Stock Options 


In sequential decision problems, a decision maker takes an action based on a currently observed state 
of the world based on the current rewards of that action and possible future rewards. Approximate 
dynamic programming is a modeling method for such problems based on approximating a value-to- 
go function. Value-to-go functions, or simply “value functions,” give the value for each state of the 
world if all optimal decisions are made subsequently. 


Often value functions are known to be convex or concave in the state variable; this is common 
in options pricing, portfolio optimization and logistics problems. In some situations, such as when 
a linear program is solved each time period to determine an action, a convex value function is 
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Figure 9: Mean weekly wage based on years of experience and years of education (top left), pre- 
dicted values using SVM regression (top right), CAP (bottom left), and smeared CAP 
(bottom right). 


required for computational tractability. Convex regression holds great promise for value function 
approximation in these problems. 


To give a simple example for value function approximation, we consider pricing American 
basket options on the average of M underlying assets. Options give the holder the right—but not 
the obligation—to buy the underlying asset, in this case the average of M individual assets, for a 
predetermined strike price R. In an American option, this can be done at any time between the 
issue date and the maturity date, 7. However, American options are notoriously difficult to price, 
particularly when the underlying asset base is large. 


A popular method for pricing American options uses approximate dynamic programming where 
continuation values are approximated via regression (Carriere, 1996; Tsitsiklis and Van Roy, 1999, 
2001; Longstaff and Schwartz, 2001). We summarize these methods as follows; see Glasserman 
(2004) for a more thorough treatment. The underlying assets are assumed to have the sample path 
{X1,...,Xr}, where X; = {S1 (t),...,Sm(t)} is the set of securities at time t. At each time t, a 
continuation value function, V;(X;), is estimated by regressing a value function for the next time 
period, Vin (X41); on the current state, X;. The continuation value is the value of holding the 
option rather than exercising given the current state of the assets. The value function is defined to 
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be the max of the current exercise value and the continuation value. Options are exercised when the 
current exercise value is greater than or equal to the continuation value. 

The procedure to estimate the continuation values is as follows (as summarized in Glasserman 
2004): 


0. Define basket payoff function, 


h(X;) = mad ESO -rol 


1. Sample N independent paths, {X1;,...,X7j}, J= 1,...,N. 
2. At time T, set Vr (Xrj) = h(Xrj). 
3. Apply backwards induction: fort = T —1,...,1, 


e given {Vjii4 (X41 Dev regress on {X; ne to get continuation value estimates 
{G (Xn) F=: 
e set value function, 
Vi (Xr) = max {h(Xij),Ci(Xj) f- 


We use the value function defined by Tsitsiklis and Van Roy (1999). 

The regression values are used to create a policy that 1s implemented on a test set: exercise 
when the current exercise value is greater than or equal to the estimated continuation value. A good 
regression model is crucial to creating a good policy. 

In previous literature, {C;(X;;) Ha has been estimated by regression splines for a single under- 
lying asset (Carriere, 1996), or least squares linear regression on a set of basis functions (Tsitsiklis 
and Van Roy, 1999; Longstaff and Schwartz, 2001; Glasserman, 2004). Regression on a set of ba- 
sis functions becomes problematic when X;; is defined over moderate to high dimensional spaces. 
Well-defined sets of bases such as radial basis functions and polynomials require an exponential 
number of functions to span the space, while manually selecting basis functions can be quite diffi- 
cult. Since the expected continuation values are convex in the asset price for basket options, CAP is 
a simple, nonparametric alternative to these methods. 

We compared the following methods: CAP and Fast CAP with D = 3, L = 10 for both and 
P’ = min(M, 10), the number of random search directions in Fast CAP; the method of Magnani and 
Boyd (2009); regression trees with constant leaves using the Matlab function classregt ree; least 
squares using the polynomial basis functions 


(1,S,(t),S;(t),8;(t),Si(t)Sj(t),A(%)), i= 1,...,M, J Ais 


ridge regression on the same basis functions with ridge parameter chosen by 10-fold cross-validation 
each time period from values between 107° and 10°. 

We compared value function regression methods as follows. We simulated for both N = 10,000 
and N = 50,000 training samples for a 3-month American basket option with a number of underly- 
ing assets, M, varying between | and 30 using a geometric Brownian motion with a drift of 0.05 and 
a volatility of 0.10. All assets had correlation 0.5 and starting value 100. The option had strike price 
110. Policy values were approximated on 50,000 testing sample paths. An approximate upper bound 
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was generated using the dual martingale methods of Haugh and Kogan (2004) from value functions 
generated using polynomial basis functions based on the mean of the assets, (1,¥,,¥7,Y¥°,h(Y;)), 
where Y, = 1/My“, X;(t), with 2,000 samples. Upper and lower bounds were generated using 5 
training and testing sets. 


Results are displayed in Figure 10. We found that CAP and Fast CAP gave state of the art per- 
formance without the difficulties associated with linear functions, such as choosing basis functions 
and regularization parameters. We observed a decline in the performance of least squares as the 
number of assets grew due to overfitting. Ridge regularization greatly improved the least squares 
performance as the number of assets grew. Tree regression did poorly in all settings, likely due 
to overfitting in the presence of the non-symmetric error distribution generated by the geometric 
Brownian motion. These results suggest that CAP is robust even in less than ideal conditions, such 
as when data have heteroscedastic, non-symmetric error distributions. 


Again, we noticed that while the performances of CAP and Fast CAP were comparable, the 
runtimes were about an order of magnitude different. On the larger problems, runtimes for Fast 
CAP were similar to those for unregularized least squares. This is likely because the number of 
covariates in the least squares regression grew like M’, while all linear regressions in CAP only had 
M covariates. 


7. Conclusions 


In this article, we presented convex adaptive partitioning, a computationally efficient, theoretically 
sound and empirically robust method for regression subject to a convexity constraint. CAP is the 
first convex regression method to scale to large problems, both in terms of dimensions and number 
of observations. As such, we believe that it can allow the study of problems that were once thought 
to be computationally intractable. These include econometrics problems, like estimating consumer 
preference or production functions in multiple dimensions, approximating complex constraint func- 
tions for convex optimization, or creating convex value-to-go functions or response surfaces that 
can be easily searched in stochastic optimization. 


CAP can be extended in a number of ways. First, as demonstrated in Hannah and Dunson 
(2012), CAP can be used in an ensemble setting—like bagging or smearing—to produce a smoother 
estimator. Averages of piecewise linear estimators are particularly useful in an optimization setting. 
They are still piecewise linear and can be searched by a linear program, but have more stable min- 
imum locations than sparse methods like CAP and the linear fitting method of Magnani and Boyd 
(2009). Second, CAP can be extended to more shape constrained settings, like monotone, concave 
and semi-convex functions. Monotone, concave functions are concave functions with increasing 
slopes, which are common in economics. Although CAP is not a general purpose shape constrained 
inference method, a variant for monotone functions can easily be generated by placing positivity 
constraints on the parameters of the linear models. Semi-convex functions are convex in some 
dimensions but unconstrained in others. Variants of CAP could be combined with other nonpara- 
metric methods like kernels to produce efficient inference methods for partially convex functions. 
We believe that the methods supporting the CAP algorithm can bring efficient, theoretically sound 
inference to a variety of shape constrained problems that are inapproachable with traditional meth- 
ods. 
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Average values (top) and runtimes (bottom) plus/minus one standard error for pricing 
America basket options as a function of the number of underlying assets are shown for 
CAP, Fast CAP, tree regression with constant leaves (CART), the method of Magnani 
and Boyd (2009), least squares (LS) and ridge regularized least squares value function 
approximation. 
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8. Online Supplements 


Code for CAP can be downloaded at http: //www.columbia.edu/~1lah2178/Research.html and 
as an online supplement at the JMLR website. 
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Appendix A. 


The proof of Theorem | is essentially identical to the proof of Theorem | of Chaudhuri et al. 
(1994) with a few modifications. The Chaudhuri et al. (1994) results are for an algorithm that splits 
subsections parallel to axes. By allowing subsets to be determined by the dominating hyperplanes, 
the subsets are now polyhedral with a maximal number of faces determined by the dimension and 
maximal number of subsets. To show this, we modify Lemma 12.27 of Breiman et al. (1984). 
Proof [Proof of Theorem 1] It is sufficient to show that 


aij Z 
Nes ee dik | Ox Bal = A(Xx)| > 0 


jah n 


in probability, where A(X;) is the vector of elements [fo (xx, d7! a {Xpand | Ka fool. Let 
Pp 


a? = Qk — Bi Xx. Assumption A3. ensures that the matrices Dx for all subsets are nonsingular with 
probability tending to 1, where Dy = Yy.cc, Fil. Letting Y; = fo(x;) + €;, we have 


a, Be = Dy È Vifo(xi)+Dz' Y Tie (6) 


X;ECk X;ECk 


for k =1,...,K, with probability tending towards |. Doing a Taylor expansion of Equation (6), we 
get 
(a, Bal = A(Xx) = D7’ ` T;r(Xi = X) +D;," X lE 


IEC LEC; 


where r(x; — X,) are the second order and above terms of the Taylor expansion of fo(X,). Assump- 
tions A1., A3. and A4. ensure max,=1..._K, aD Liec, Vir(xi — Xx) | — 0 in probability as n > œ. 
To bound the random error term of Equation (6), we first assume that A; is fixed. Applying Lemma 
12.26 of Breiman et al. (1984) to each component of a, |Gul Ye [;€;, there exist constants 
hı > 0, m > 0 and Yo > O such that 


P (ax 


whenever Y < Yo. Modifying Lemma 12.27 of Breiman et al. (1984) to account for the greater VC 
dimension of the subsets, we note A5. bounds the number of polyhedral faces for each subset to be 





IG’ Y Te; 


icC; 








> r) < jy e72 (7) 
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K,(p +2). Following the proof of 12.27, for m, such that m,/log(n) — œ, we use Equation (7) to 
show 


P (| [oxy Bx’ B A(Xx) | Z YlXi:n) < hye T naala, 
< hye etm log(n) 
— hyn 2Y" 
on the event d?,|C;|/n > my, log(n)/n. Using the VC dimension of the partition, 
P (| foe, By)’ — A(Xx) | > y for each Ax and d-,\Ci| > mDlog(n)) 


< fy?! +Kn(P+2) pKa(p+2) -hmn 


Since A5. ensures that m, — œ, the result holds. 
= 


Proof [Proof of Theorem 2] Fix € > 0; let dx be the diameter of X. Choose N such that for every 


n>N 
€ 
P4 max suplar +Bix-— fo(x)| > —— > <e/2, 
f pas, sp i ( ) Cdx / 
€ 
P4 max sup ||Bk -YPI > = ? <&/2. 
k=1,....KnxeA, Cdx 
Fix a 6 net over X such that at least one point of the net sits in A; for each k = 1,...,K. Let ng be 


the number of points in the net and let x? be a point. Then, 


max + Bix — fo(x) 


= s hed i n 


ah 








P f sup fn(X) — fo(x)| > e} =P {sup 


xEX 














€ 
<P{ max max O% + BF x? — fo (x?) = =}. 
i=1,...,.ng |k=1,...,Kn C 
<2 T ő § € 
<P (o BE) ge AD SE, 
= =n L k + By X; {x9 Ay} fo(x;) Cay 
<E. 


Appendix B. 


The CAP algorithm offers two main computational bottlenecks. First, it searches over all cardinal 
directions, and only cardinal directions, to produce candidate models. Second, it keeps generating 
models until no subsets can be split without one having less than the minimum number of observa- 
tions. In most cases, the optimal number of components is much lower than the terminal number of 
components. 
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To alleviate the first problem, we suggest using P’ random projections as a basis for search. 
Using ideas similar to compressive sensing, each projection g; ~ N,(0,/) for j= 1,...,P’. Then we 
search along the direction gi Xx rather than x;. When we expect the true function to live in a lower 
dimensional space, as is the case with superfluous covariates, we can set P’ < p. 

We solve the second problem by modifying the stopping rule. Instead of fully growing the tree 
until each subset has less than 2n/(2log(n)) observations, we use generalized cross-validation. We 
grow the tree until the generalized cross-validation value has increased in two consecutive iterations 
or each subset has less than 2n/(2log(n)) observations. As the generalized cross-validation error is 
usually concave in K, this heuristic often offers a good fit at a fraction of the computational expense 
of the full CAP algorithm. 

The Fast CAP algorithm has the potential to substantially reduce the log(n)” factor by halting 
the model generation long before K reaches Dlog(n). Since every feasible partition is searched for 
splitting, the computational complexity grows as k gets larger. 

The Fast CAP algorithm is summarized as follows. 


B.1 Fast Convex Adaptive Partitioning (Fast CAP) 
1. Initialize. As in CAP. 


2. Split. 


a. Generate candidate splits. Generate candidate model M jke by 1) fixing a subset k, 2) 
generating a random direction j with g; ~ N,,(0,/), and 3) dividing the data as follows: 


e set rae = min{ gj x; Lhe Crh. ee = max {g} x; TE Ci} and Dike E aorin + (1 > 


ik 
ae) Xinax 
e set 
nP T bad T 
C; = fi LE Ck, 8; Xi < bike}, Cki = fi LE Ck, 8j x; > bike}, 
T ; T 
Aj, = {X : X E€ Ak, gj X < bje}, As) = {X : X E Ak, Zj X > bjr}. 


Then new hyperplanes are fit to each of the new subsets. This is done for L knots, P’ 
dimensions and K subsets. 


b. Select split. As in CAP. 
3. Refit. As in CAP. 


4. Stopping conditions. Let GCV (Mx) be the generalized cross-validation error for model Mx. 
Stop if GCV (Mg) > GCV(Mx_) and GCV (Mg—1) > GCV(Mx_2) of if |Ck| < 2nmin for 
k= 1,...,K. Then select final model as in CAP. 


Appendix C. 


In this subsection, we empirically examine the effects of the two tunable parameters, the log factor, 
D, and the number of knots, L. The log factor controls the minimal number of elements in each 
subset by setting |C,| > n/(Dlog(n)), and hence it controls the number of subsets, K, at least for 
large enough n. Increasing D allows the potential accuracy of the estimator to increase, but at the 
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Figure 11: Log factor D (log scale) vs. mean squared error (log scale) for CAP and Fast CAP (top). 
Log factor D (log scale) vs. runtime in seconds (log scale) (bottom). Both methods were 
run on problem 1 (left) and problem 2 (right) with n = 500 and n = 5,000. Lines are 
mean value and shading represents one standard error. 


cost of greater computational time due to the increase in possible values for K and the larger number 
of possibly admissible sets generated in the splitting step of CAP. 


We compared values for D ranging from 0.1 to 20 on problems 1 and 2 with sample sizes of 
n = 500 and n = 5,000 over 100 training sets and one testing set. Results are displayed in Figure 
11. Note that error may not be strictly decreasing with D because different subsets are proposed 
under each value. Additionally, Fast CAP is a randomized algorithm so variance in error rate and 
runtime is to be expected. 


Empirically, once D > 1, there was little substantive error reduction in the models, but the 
runtime increased as O(D”) for the full CAP algorithm. Since D controls the maximum partition 
size, K, = Dlog(n), and a linear regression is fit K log(K ) times, the expected increase in the runtime 
should only be O(Dlog(D)). We believe that the extra empirical growth comes from an increased 
number of feasible candidate splits. In the Fast CAP algorithm, which terminates after generalized 
cross-validation gains cease to be made, we see runtimes leveling off with higher values of D. Based 
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Figure 12: Number of knots L (log scale) vs. mean squared error (log scale) for CAP and Fast CAP 
(top). Number of knots L (log scale) vs. runtime in seconds (log scale) (bottom). Both 
methods were run on problem 1 (left) and problem 2 (right) with n = 500 and n = 5,000. 
Lines are mean value and shading represents one standard error. 


on these results, we believe that setting D = 3 offers a good balance between fit and computational 
expense. 


The number of knots, L, determines how many possible subsets will be examined during the 
splitting step. Like D, an increase in L offers a better fit at the expense of increased computation. 
We compared values for L ranging from 1 to 50 on problems 1 and 2 with sample sizes of n = 500 
and n = 5,000 over 100 training sets and 1 testing set. Results are displayed in Figure 12. 


The changes in fit and runtime are less dramatic with L than they are with D. After L = 3, the 
predictive error rates almost completely stabilized. Runtime increased as O(L) as expected. Due to 
the minimal increase in computation, we feel that L = 10 1s a good choice for most settings. 
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Figure 13: Number of observations n (log scale) vs. square root of mean squared error (log scale) 
for problem 1 (A) and problem 2 (B). Linear models are fit to find the empirical rate of 
convergence. 


Appendix D. 


Although theoretical rates of convergence are not yet available for CAP, we are able to empirically 
examine them. Rates of convergence for multivariate convex regression have only been studied 
in two articles of which we are aware. Aguilera et al. (2011) studied rates of convergence for an 
estimator that 1s created by first smoothing the data, then evaluating the smoothed data over an €-net, 
and finally convexifying the net of smoothed data by taking the convex hull. They showed that the 
convexify step preserved the rates of the smoothing step. For most smoothing algorithms, these are 
minimax nonparametric rates, n pe with respect to the empirical 22 norm. 

Hannah and Dunson (2011) showed adaptive rates for a Bayesian model that places a prior over 
the set of all piecewise linear functions. Specifically, they showed that if the true mean function fo 
actually maps a d-dimensional linear subspace of X to R, that is 


fo(X) = go(xA), A E R?*¢, 


then their model achieves rates of log~!(n)n7 72 with respect to the empirical 42 norm. Empirically, 
we see these types of adaptive rates with CAP. 

In Figure 13, we plotted the number of observations against the square root of the mean squared 
error in a log-log plot for problems | and 2. We then fitted a linear model for both CAP and Fast 
CAP. For problem 1, p = 5 but d = 3, due to the sum in the quadratic term. Likewise, for problem 2, 
p = 10 but d = 1 because it is an exponential of a linear combination. Under standard nonparametric 
rates, we would expect the slope of the linear model to be —4 for problem | and — i for problem 
2. However, we see slopes closer to —i and -i for problems 1 and 2, respectively; values are given 
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Method Problem 2 
Expected: Rates in p | —0.1429 —0.0833 
Expected: Rates in d —0.3333 
Empirical: CAP 


—0.2003 —0.2919 
Empirical: Fast CAP | —0.2234 —0.2969 


Table 2: Slopes for linear models fit to log(n) vs. log(V MSE) in Figure 13. Expected slopes 
are given when: 1) rates are with respect to full dimensionality, p, and 2) rates are with 
respect to dimensionality of linear subspace, d. Empirical slopes are fit to mean squared 
error generated by CAP and Fast CAP. Note that all empirical slopes are closest to those 
for linear subspace rates rather than those for full dimensionality rates. 





in Table 2. These results strongly imply that CAP achieves adaptive convergence rates of the type 
shown by Hannah and Dunson (2011) for problems 1 and 2. 
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